class: center, middle, inverse, title-slide .title[ # Class 3b: Review of concepts in Probability and Statistics ] .author[ ### Business Forecasting ] --- <style type="text/css"> .remark-slide-content { font-size: 18px; padding: 16px 64px 16px 64px; } .remark-slide-content h2 { font-size: 1.4em; margin-top: 0.2em; margin-bottom: 0.3em; } .remark-slide-content h3 { font-size: 1.2em; margin-top: 0.2em; margin-bottom: 0.3em; } .remark-slide-content h4 { font-size: 1.05em; margin-top: 0.1em; margin-bottom: 0.2em; } .remark-slide-content ul { margin-top: 0.2em; margin-bottom: 0.2em; } .remark-slide-content li { margin-bottom: 0.1em; } .learnr-ref { background: #eef2f7; border-left: 3px solid #2980b9; padding: 3px 8px; font-size: 0.85em; margin-top: 0.3em; display: inline-block; } </style> --- layout: false class: inverse, middle # Confidence Intervals --- ### Confidence Intervals - We calculated the mean price in our sample - How confident are we that our estimate is close to the parameter's value? - Confidence intervals measure uncertainty around the estimate --- ### Confidence Intervals - Mean price was 1245.43 -- - Is it reasonable to think true average price in population is 1100? What about 2000? -- - Suppose that we calculated the confidence interval to be: `$$\{1086.64, 1404.22\}$$` -- - Where are these numbers coming from? -- 1. The sampling distribution of the sample mean tells us how likely we are to get a point estimate which is far away from the true mean -- 2. The confidence interval uses this property of the sampling distribution to tell us where the true mean might be -- - Let's go through these statements 1-by-1 --- ### Sampling distribution **Q: How likely is it that a sample mean is far away from the true mean?** - Consider a hypothetical sampling distribution of a sample mean - Reminder: `\(\bar{x} \sim \mathcal{N}(\mu, \frac{\sigma}{\sqrt{n}})\)` -- - If we draw samples repeatedly, 95% of their means will be within the shaded area - Why 1.96? <img src="data:image/png;base64,#C_3_slides_b_files/figure-html/unnamed-chunk-1-1.png" width="100%" /> --- ### Yet another way to see it `\begin{align*} 0.95&= P(-1.96<Z<1.96) \\ &= P(-z_{\frac{\alpha}{2}}<Z<z_{\frac{\alpha}{2}}) \\ &= P\left(-z_{\frac{\alpha}{2}}<\frac{\bar{X}-\mu}{\sigma/\sqrt n}<z_{\frac{\alpha}{2}}\right) \\ &= P\left(-z_{\frac{\alpha}{2}}\cdot\sigma/ \sqrt n<\bar{X}-\mu<z_{\frac{\alpha}{2}}\cdot\sigma/\sqrt n\right) \\ &= P\left(\mu-z_{\frac{\alpha}{2}}\cdot\sigma/\sqrt n< \bar{X} <\mu+z_{\frac{\alpha}{2}}\cdot\sigma/\sqrt n\right) \\ \end{align*}` - Theoretically, CLT theorem guarantees that `\(\frac{\bar{X}-\mu}{\sigma/\sqrt n}\)` is standard normal -- - What happens if you do not know `\(\sigma\)`? -- - In large sample, `\(s \rightarrow \sigma\)`, so `\(\frac{\bar{X}-\mu}{s/\sqrt n} \rightarrow N(0,1)\)` - So in large samples, standardized sample mean (with estimated standard deviation) will also have normal distribution - You may need a bit higher n to ensure `\(s \rightarrow \sigma\)` --- ### Sampling distribution **Q: How far is the sampled mean from the true mean?** - Hence 95% of the draws of sample means will be within distance of `\(1.96\frac{\sigma_X}{\sqrt n}\)` to the true parameter - There is only 5% chance that we have draw sample weird enough that `\(\bar{X}\)` is further from `\(\mu_X\)` by more than `\(1.96\frac{\sigma_X}{ \sqrt n}\)` - Confidence interval of `\(\bar{X}\)` will cover `\(\mu_X\)` as long as `\(|\mu_X-\bar{X}|<1.96\frac{\sigma_X}{\sqrt n}\)` <img src="data:image/png;base64,#C_3_slides_b_files/figure-html/unnamed-chunk-2-1.png" width="100%" /> --- ### Sampling distribution - Suppose we draw many samples from the same distribution - For each sample we compute the sample mean and we construct the interval - 95% of them will cover the true population mean! <iframe src="https://seeing-theory.brown.edu/frequentist-inference/index.html#section2" width="100%" height="400px" data-external="1"></iframe> Source: [Seeing Theory](https://seeing-theory.brown.edu/frequentist-inference/index.html#section2) --- ### Calculation Procedure Use this procedure if `\(n>40\)` 1. Take an IID sample -- 2. Calculate mean `\(\bar{x}\)` and standard deviation `\(s\)` in your sample - Standard Error is standard deviation of the estimator `\(\small SE=\frac{s}{\sqrt n}\)` -- 3. Pick confidence level (usually 90,95,99%) - We typically denote the confidence level `\(1-\alpha\)` - `\(\alpha\)` is probability of making a Type 1 error (more about it later) - .blue[Example]: if confidence level is 95%, `\(\small \alpha=0.05\)` -- 4. Find the corresponding critical values `\(\small z_{\frac{\alpha}{2}}\)` - Critical values are such that `\(\small P(-z_{\frac{\alpha}{2}}<Z<z_{\frac{\alpha}{2}})=1-\alpha\)` - .blue[Example]: if confidence level is 95%, `\(\small z_\frac{\alpha}{2}=z_{0.025}=1.96\)` -- 5. Construct the confidence interval as: `$$\small \left\{\bar{x}- z_{\frac{\alpha}{2}}\cdot\underbrace{\frac{s}{\sqrt{n}}}_{SE},\;\; \bar{x}+ z_{\frac{\alpha}{2}}\cdot\frac{s}{\sqrt{n}}\right\}$$` --- ### Finding Critical Values - Suppose confidence level is 99%, so `\(\alpha=0.01\)` - We are looking for `\(z_{\frac{\alpha}{2}}\)` such that: .learnr-ref[**LearnR:** Finding critical values in normal — use `qnorm()` in R] `$$P(-z_{\frac{\alpha}{2}}<Z<z_{\frac{\alpha}{2}})=0.99$$` <img src="data:image/png;base64,#C_3_slides_b_files/figure-html/unnamed-chunk-3-2.png" width="100%" /> -- `$$P(Z>z_{0.005})=0.005 \qquad\text{or}\qquad P(Z<z_{0.005})=0.995$$` --- <iframe src="https://www.mathsisfun.com/data/standard-normal-distribution-table.html" width="100%" height="480px" data-external="1"></iframe> Source: [Maths is Fun](https://www.mathsisfun.com/data/standard-normal-distribution-table.html) --- ### Finding Critical Values `\(P(Z<z_{\frac{\alpha}{2}})=0.995\)` `\(z_{\frac{\alpha}{2}}\)`, is 99.5% quantile of standard normal `\(\rightarrow\)` `\(z_{\frac{\alpha}{2}}=2.58\)` <img src="data:image/png;base64,#C_3_slides_b_files/figure-html/unnamed-chunk-4-2.png" width="100%" /> --- ### Constructing CI: example Let's calculate 90% CI for average price of listings with cleanliness score > 4.5 1. Take an IID sample - `\(n=100\)` `\(\checkmark\)` -- 2. Calculate mean `\(\bar{x}\)` and standard deviation `\(s\)` - `\(\bar{x}=\)` 1245.43 and `\(s=\)` 961.9 -- 3. Pick confidence level - We pick 90%, so `\(\alpha=0.1\)` -- 4. Find the corresponding critical values `\(z_{\frac{\alpha}{2}}\)` - Find `\(z_\frac{\alpha}{2}\)` such that `\(P(Z>z_{\frac{\alpha}{2}})=0.05\)` (or `\(P(Z<z_{\frac{\alpha}{2}})=0.95\)`) - `\(z_{0.05}=1.65\)` -- 5. Construct the confidence interval as: `$$\small \left\{\bar{x}- z_{\frac{\alpha}{2}}\cdot\frac{s}{\sqrt{n}},\;\; \bar{x}+ z_{\frac{\alpha}{2}}\cdot\frac{s}{\sqrt{n}}\right\}$$` `$$\small \left\{1245.43- 1.65\cdot\frac{961.9}{\sqrt{100}},\;\; 1245.43+ 1.65\cdot\frac{961.9}{\sqrt{100}}\right\}$$` --- ### Interpreting confidence intervals `$$\small CI_{90}=\{1086.64, 1404.22\}$$` How do we interpret a 90% confidence interval we computed? - **Correct Interpretation** - We are 90% confident that the interval captures the true mean - We are 90% confident that the true mean price of listings with cleanliness score > 4.5 is between 1086.64 and 1404.22 -- - **Incorrect** - With 90% probability the true mean is between 1086.64 and 1404.22 - Computed interval is not-random and true mean is not random, so can't make probabilistic statements. - Interval is a function of random variables only **before** we draw a sample and make any computation. - After we have a sample, nothing is random. The true mean is either between 1086.64 and 1404.22 or not. --- <center> <img src=airbnb_True.png width="800"> </center> --- ### Shape of confidence intervals CI `\(= \left\{\bar{x}\pm z_{\frac{\alpha}{2}}\cdot\frac{s}{\sqrt{n}}\right\}\)` are wider when: - Confidence level is higher (99% is wider than 90%) - When `\(n\)` is small - When `\(\sigma\)` is large <iframe src="https://seeing-theory.brown.edu/frequentist-inference/index.html#section2" width="100%" height="380px" data-external="1"></iframe> --- ### Practice In learnr: 0. Set seed to your student id 1. Take a sample of 100 observations from the data 2. Calculate the mean and standard deviation of the sample 3. Find critical value 4. Compute confidence interval .learnr-ref[**LearnR:** Practice — Build a 90% CI for ride-hailing fares and reveal the true mean] --- ### What critical values? When should we use critical values from Normal Distribution? 1. Original distribution (of `\(X\)`) is not normal: - If `\(n>30\)` — use critical values from normal distribution (use `\(n>40\)` if `\(\sigma\)` unknown, to give `\(s\)` enough data to approximate `\(\sigma\)`) - If `\(n<30\)` — normal approximation is unreliable; you need specialized methods (e.g., bootstrap) -- 2. Original distribution (of `\(X\)`) is normal: - If you know `\(\sigma\)`, you can use critical values from normal ( `\(n\)` doesn't matter) - If `\(X\)` is normal, then use `\(\sigma\)` instead of `\(s\)` and `\(\frac{\bar{X}-\mu}{\frac{\sigma}{\sqrt n}} \sim N(0,1)\)` - If you don't know `\(\sigma\)` but `\(n>40\)`, you can use critical values from normal - CLT kicks in - If you don't know `\(\sigma\)` and `\(n<40\)`, you use critical values from .blue[student's t]. - `\(\frac{\bar{X}-\mu}{\frac{s}{\sqrt n}}\)` is not normal. `\(s\)` is not a good approx. of `\(\sigma\)` when `\(n\)` is low --- ### What's Student's t? If `\(X_1\)`, `\(X_2\)`, . . . , `\(X_n\)` are i.i.d. from `\(N(µ, σ)\)`, then `$$T =\frac{\bar{X} − µ}{s/\sqrt n}$$` Where `\(s\)` is sample standard deviation. T has a student's t distribution with n−1 degrees of freedom `$$T \sim t_{n-1}$$` --- ### What's Student's t? - Bell shaped and symmetric around 0 - More spread out - heavier tails, more uncertainty (because we don't know standard deviation) - Shape determined by the degrees of freedom. - As n increases (and hence degrees of freedom), it tends to standard normal (as it should by CLT!) - Less uncertainty because we are better at estimating standard deviation
--- ### Student's t critical values Finding critical values for student's t distribution: 1. Determine what is the right number of degrees of freedom ( `\(n-1\)` )! 2. Determine what's your confidence level and your `\((1-\alpha)\)` - From this figure out `\(\alpha/2\)` 3. Find the percentile such that `$$P(T>t_{\frac{\alpha}{2},\underbrace{n-1}_{d.f.}})=\frac{\alpha}{2} \qquad\text{or}\qquad P(T<t_{\frac{\alpha}{2},n-1})=1-\frac{\alpha}{2}$$` <img src="data:image/png;base64,#C_3_slides_b_files/figure-html/unnamed-chunk-5-1.png" width="100%" /> --- ### Example - Practice in LearnR finding critical values using the code: .learnr-ref[**LearnR:** Finding critical values in Student's t — compare `qt()` vs `qnorm()`] ```r qt(quantile, degrees_of_freedom) #ex qt(0.975, 9) ``` --- - `\(n=10 \rightarrow df=9\)` - Confidence level is 95% `\(\rightarrow1-\alpha=0.95\)` and `\(\frac{\alpha}{2}=0.025\)` - What's `\(t_{0.025,9}\)` such that `\(P(T<t_{0.025,9})=0.975\)` <img src="data:image/png;base64,#C_3_slides_b_files/figure-html/unnamed-chunk-6-1.png" width="100%" /> - Once we have critical value, we construct the CI as before: `$$\small \left\{\bar{x}- t_{\frac{\alpha}{2},n-1}\cdot\frac{s}{\sqrt{n}},\;\; \bar{x}+ t_{\frac{\alpha}{2},n-1}\cdot\frac{s}{\sqrt{n}}\right\}$$` --- ### Practice: Free Shipping CI Your company implemented free shipping for a random group of customers. They want to know whether it increased spending. Here is your data: $157.80, $192.45, $210.20, $175.60, $198.30, $180.90, $205.75, $185.20, $177.40, $195.60 a) Calculate 90% confidence interval. What assumptions you need? -- b) Average spending without free shipping is $182, can say anything about whether free shipping increased spending? .learnr-ref[**LearnR:** Practice with Student T — compute the CI and draw conclusions] --- ### Confidence Intervals for Variance - Sometimes we care about how spread out a process is, not just its mean - Example: in manufacturing, a machine that produces bolts with high variance in diameter is unreliable, even if the average diameter is correct -- - Suppose `\(X_1, X_2, ...X_n\)` come from a normal distribution - The sampling distribution of the sample variance `\(S^2=\frac{\sum_i(X_i-\bar{X})^2}{n-1}\)` is: `$$\small \frac{(n-1)S^2}{\sigma^2}\sim\chi^2_{n-1}$$` -- - We will use the fact that: `$$\small P(\chi^2_{0.025,n-1}<\frac{(n-1)S^2}{\sigma^2}<\chi^2_{0.975,n-1})=0.95$$` <img src="data:image/png;base64,#C_3_slides_b_files/figure-html/unnamed-chunk-7-1.png" width="100%" /> --- ### Confidence Intervals for Variance How we use it to construct the confidence interval? `\begin{align*} 0.95&= P\left(\chi^2_{0.025,n-1}<\frac{(n-1)S^2}{\sigma^2}<\chi^2_{0.975,n-1}\right) \\ &= P\left(\frac{1}{\chi^2_{0.975,n-1}} < \frac{\sigma^2}{(n-1)S^2} < \frac{1}{\chi^2_{0.025,n-1}}\right) \\ &= P\left(\frac{(n-1)S^2}{\chi^2_{0.975,n-1}} < \sigma^2 < \frac{(n-1)S^2}{\chi^2_{0.025,n-1}}\right) \end{align*}` -- So more generally, the confidence interval for the sample variance is `$$CI_{1-\alpha}=\left\{\frac{(n-1)S^2}{\chi^2_{1-\frac{\alpha}{2},n-1}}, \frac{(n-1)S^2}{\chi^2_{\frac{\alpha}{2},n-1}}\right\}$$` - Where `\(\chi^2_{1-\frac{\alpha}{2},n-1}\)` and `\(\chi^2_{\frac{\alpha}{2},n-1}\)` are quantiles of the `\(\chi^2_{n-1}\)` distribution - `\(P(X<\chi^2_{1-\frac{\alpha}{2},n-1})=1-\frac{\alpha}{2}\)` and `\(P(X<\chi^2_{\frac{\alpha}{2},n-1})=\frac{\alpha}{2}\)` - You can find them in R using: ```r qchisq(alpha, df) ``` --- ### Practice: Sausage Variance CI Suppose you produce sausages. As quality control, you measure the level of fat in your sausages. You take a random sample of 12 sausages and you find the sample variance of 20 `\(\text{grams}^2\)`. Find 99% confidence interval for the population variance. What assumptions do you need? .learnr-ref[**LearnR:** Practice for CI of variance — use `qchisq()` to find critical values and build the interval] ---